Data Analytics Part-1

Why do I need to learn Data Analytics?

The process of Data Analytics



  1. What are your objectives?
  2. Data Acquisition
  3. Cleaning Data
  4. Exploratory Analysis
  5. Modeling and validation
  6. Visualizing Results
  7. Report

What is Pandas?

Importing Pandas to Python

Similar to importing turtle library, the following line imports pandas and creates a pandas object with the name pd.

        import pandas as pd
    

From now and on, you can use the object pd to perform pandas operations.



Data Acquisition: 1. Read data from Excel file

ndas allows you to access various sources of data including Excel, Comma Separated Values (CSV), tables, text, database, web and more.

The general syntax to read data from excel file is as follows:




Data Acquisition: 2. Read data from CSV file.

Read data from a flat file - Comma Separated Values (CSV).

	    Variable = pd.read_csv("Filename.csv", delimiter=",")
	

Example

Using the data below:

Now, the following command will read the data file and display the content in a partcular format:

	from google.colab import files
        uploaded = files.upload()
        import pandas as pd
        emirates = pd.read_csv("uae.csv",delimiter=',')
        emirates
	

Data Acquisition: 3. Read data from Web page (HTML).

Read data from a web page

	    Variable = pd.read_html("url")
	

Note:read_html returns a list of DataFrame objects

Using the webpage content below:


Now, the following command will read the webpage and display the content in a partcular format:

	import pandas as pd
        url = 'https://en.wikipedia.org/wiki/Demographics_of_the_United_Arab_Emirates'
        dfs = pd.read_html(url)
        dfs[3]

	

Pandas Data Frame

DataFrame: DataFrame is a two (or more) dimensional data structure – basically a table with rows and columns. The columns have names and the rows have indexes.


Let us examine the pandas data frame in more detail.

Data Cleaning: 1. Skipping unnecessary rows and columns

Pandas allows you to skip unwanted rows when reading the data.

	    data2 = pd.read_excel("../Data/HigherEducationOpenData.xlsx",sheetname="2015 − 2016",skiprows=4)
	

We basically added comma then skiprows=4 to the original line that reads data from the excel file. Store the result in a new data frame as data2.


Data Cleaning: 2. Working with missing data

In many cases you will find your self dealing with missing data. This is indicated by NaN in the cells or just empty cell. How to deal with such situation.

  1. Drop missing observations. This will delete all rows that contain any missing data: df_no_missing = df.dropna()
  2. Drop rows where all cells in that row is NA. This will delete rows that are missing all its data. df_cleaned = df.dropna(how='all')
  3. Drop column if they only contain missing values To force dropna to work on columns you need to add axis=1. df.dropna(axis-1,how='all')
  4. Fill in missing data with zeros This will replace NaN with 0. df.fillna(0)

Note:
keyword how="all" to indicate all cells have missing data
keyword how="any" to indicate any cell have missing data

Example: Filling in missing data

In this example we have a simple data set that contains students score pre and post test. The data set contains number of missing data of students who did not attend the exam. These missing data are appearing as NaN. Our simple task is to replace NaN with average (mean) score of the column.

The data set is included in the excel file "Grades.xlsx" that we used in the previous example. It located in the sheet "PrePost".

Here is the code needed to read the sheet.

	from google.colab import files
        uploaded = files.upload()
        import pandas as pd
        data = pd.read_excel("Grades.xlsx",sheet_name="PrePost")
        data
	

Fill all the missing values with 0 permanently using the below example.

    data.fillna(0, inplace = True)
    data
	

Note: use inplace = True to make the changes permenent.

Data Cleaning: 3. Renaming columns

When reading data from a data source you may get some columns with long or unappropriated names. Here you will learn how to rename a column.


Note that inplace=True is needed to ensure that the changes you may are permanent. Otherwise, this line we display the new column name and if you try to access the same DataFrame, then you will get the old name.

Example: Renaming columns

Following command will rename column Question1, Question2, Question3, and Question4 to Q1, Q2, Q3, Q4.

	data.rename(columns={"preTestScore":"preTest","postTestScore":"postTest"}, inplace = True)
        data

	

Summary Command




For more details, please contact me here.
Date of last modification: 2021